HLT4HRP Data Management

Michael Girdwood

Hi! 👋

My name is Mick.

  • I am a physiotherapist, researcher and PhD Student at La Trobe University.
  • I’ve worked across a variety of projects in musculoskeletal health, focusing at the moment on knee and hip injuries
  • I have a special interest in working with data
  • m.girdwood@latrobe.edu.au
  • ORCiD

Data Management

Why is data management important?

  • Minimise errors in research
  • Ensure safety and privacy of participants
  • Compliance with policy and law
  • Enable sharing with other organisations

What is research data?

Data management plan


Plan exactly what you will be collecting
before starting your data collection


  • What information?
  • How will the information be gathered?
  • In what format will the data be recorded?
  • How will you recieve the data?
  • How will store, combine or link the data?
  • How will you analyse your data?

Imagine what the end goal of your data is…

What information?

  • Consent (hopefully!)
  • Contact details
  • Health data
  • Personal information
  • Dates, times, locations
  • Images, video



= Any information collected, observed, generated or created in the process of your research

How will the information be gathered and recorded?

  • By the participant
    • anonymously vs linked responses
    • one off vs repeated measures
    • by invitation directly vs random sampling
  • By you (the researcher)
    • concurrently or later
    • manually filling out vs software operation
  • By another team member
    • external organisations
    • health practitioners

How will you recieve the data?

  • Online
  • Download from a portal
  • Entered manually from paper 🥴
  • From external organisation
  • Scraped from online source/database







Data storage and linkage

KEY RESOURCES:

Electronic Data

  • Research DataSpace (RDS)
  • LTU Research Drives (P: Drive)

Hardcopy Data

  • Secure file storage (locked cabinets)

Not appropriate

  • Dropbox, Google Drive, One Drive
  • Personal computer*

Data analysis







…coming soon - Quantitative Methods Workshop

Best case scenario!

A bad scenario…

Why excel spreadsheets can be bad

  • They are editable
  • They can be overwritten
  • The biggest liability in the system (the human 🤦‍♀️) fiddles with them
  • Not great for large data storage
  • They store data in strange ways, which can cause headaches
  • They are rarely where any analysis is carried out

…but if you must…

Spreadsheet organisation

Be consistent!

  • Be consistent
  • Be consistent
  • Be consistent
  • Be consistent
  • Be consistent
  • Be consistent
  • Be consistent
  • Be consistent
  • Be consistent







Repeating the same mistake 10x is easier to fix than making 10 different mistakes

Be consistent - identifier

  • Each person/animal/thing you are studying will need a unique id
  • Use the same one everywhere
    • every file
    • every datasheet
    • every piece of software
example 1 example 2 example 3
k002 1859 LTU295
k003 1739 LTU304
k004 1069 LTU205
k005 1204 LTU395
k006 3801 LTU591
... ... ...

Be consistent - variable names

  • Use logical names, easy to understand names
  • No white space in variable names, or special characters #$%)@-!

sex

height

date_surgery

Be consistent - variable names

  • Use logical names, easy to understand names
  • No white space in variable names, or special characters #$%)@-!

Use consistent syntax:

l_leg_length

r_leg_length

dass_anxiety_1

dass_stress_2

Lleg_length

r_leg-length

dass_anxiety_q1

dass_scale_stress_2

  • all lower case can be useful
  • avoid numbers at start of string (e.g. 2nd_doctor_name)
  • stick to same ‘styling’ e.g. _ to separate info in variable naming

Be consistent - variable coding

Especially important for categorical variables:

E.g.: a variable in which we are recording handedness (right / left)

right, left, right, right, left, right

1, 2, 1, 1, 2, 1

R, L, R, R, L, R

right, l, Right, r, 1, right

  • choose an option and stick to it
  • some software will require codes (1,2,3,4) instead of strings (right,left)
  • if using codes, make a data dictionary

Be consistent - missing values

White et al, 2013

Be consistent - missing values



id timepoint steps RPE
ft01 1 1294 8
ft01 2 NA NA
ft02 1 121 3
ft02 2 51231 NA
ft03 1 NA NA
ft04 1 1653 10
ft04 2 NA 5
ft05 1 12341 3
ft06 1 12521 NA
ft06 2 NA NA

Be consistent - missing values



id timepoint steps RPE
ft01 1 1294 8
ft01 2
ft02 1 121 3
ft02 2 51231
ft03 1
ft04 1 1653 10
ft04 2 5
ft05 1 12341 3
ft06 1 12521
ft06 2

Be consistent - other advice

  • careful with blank spaces " male " is not the same as "male"

  • dates can be a pain (especially in Excel!), pick a format and stick to it

    • ISO is YYYY-MM-DD -> 2024-02-16
    • store date as text where possible (excel won’t mess with it this way)

Be consistent - other advice

  • minimise ‘free text’ unless you need it for your research question

  • keep your file names consistent:

    • bloodmarkers_processed_20231202.csv
    • bloodmarkers_processed_20240104.csv
    • bloodmarkers_processed_20240207.csv

Be consistent - version control

  • Good habbit for all documents!
  • version +- date?
  • kneeoapaper_v1_20240216.docx

Make your spreadsheets a rectangle

Broman & Woo 2018

Rectangles :)

Multiple rectangles are ok! As long as they’re linked by identifying variables

Wide form vs Long form

Tidy Data

Store one piece of information per cell


Country Name Cases
Afghanistan 1999 John Smith 2523035
Afghanistan 2000 Julia Proud 23428
Norway 2003 Holga Svensson 60123
Norway 2005 Erik Bryans 1012959
Germany 1999 Klaus Schmidt 912509
Germany 2005 Sofia Ellins 12093


Country Year First Name Surname Cases
Afghanistan 1999 John Smith 2523035
Afghanistan 2000 Julia Proud 23428
Norway 2003 Holga Svensson 60123
Norway 2005 Erik Bryans 1012959
Germany 1999 Klaus Schmidt 912509
Germany 2005 Sofia Ellins 12093

Spreadsheets - what not to do!

Don’t leave blank spaces

  • if it’s missing, use your missing value
  • if it’s not rectangular, make it a rectangle

Don’t use colouring, bolding, highlighting, field commenting ⛳️ etc

  • many programs won’t even be able to read these fields
  • if you are signalling extra information by highlighting, instead add it as another column

No graphs and formulas in spreadsheets!

Your turn!!

Clean up this spreadsheet:

Link to file

Data Dictionary

Describe what the variables are, and how they are coded

For you and for the future researcher!


variable_name name/description coding notes
id Patient ID
sex sex at birth 1 = females;
2 = males
self reported
dom_hand dominant hand 1 = right;
2 = left
height height in cm numeric
highest_ed highest education achieves 1 = Year 10;
2 = VCE;
3 = TAFE;
4 = University

Your turn!!

Make a data dictionary for the file you’ve been working on

Data storage

  • Adhere to your data management plan!
    • Store in a secure place
    • Be careful with anything potentially identifiable
  • Use formats that are open and friendly
    • .xlsx files require Microsoft Excel to run - not everyone has this (though it is still probabl yok to use)
    • .csv files are universally readable, efficient etc

Data storage

  • Don’t ever work on a master data file.
    • Once your data is cleaned store it
    • Your analysis is conducted on a copy of that, with only the relevant information required
flowchart TD
A[Raw data\nsource 1] --> B(Data Cleaning)
C[Raw data\nsource 2] --> B
D[Raw data\nsource 3] --> B
B --> E{Cleaned Database}
E --> F(Analysis 1)
E --> G(Analysis 2)


Back-ups - MAKE THEM!



3

Different copies of data

📄 📄 📄

2

Different media

💻 💾

1

Off site

☁️

Back-ups - Example:


  • Personal laptop
  • Backup hard-drive
  • La Trobe cloud file storage


  • 3 copies ✅✅✅
  • 2 different media ✅✅
  • 1 offsite/accessible ✅

Useful references

Thank you!

Questions: m.girdwood@latrobe.edu.au